gl clone: clean partial clone of repos with private subtrees#33
gl clone: clean partial clone of repos with private subtrees#33beardthelion wants to merge 7 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds backend helpers and a new REST endpoint exposing per-caller withheld and reinclude path globs, and a ChangesBackend Withheld Paths Support
CLI Clone Command with Partial Clone Support
Sequence Diagram(s)sequenceDiagram
participant CLI as gl CLI
participant Node as Gitlawb Node (/withheld-paths)
participant Git as git (local)
participant Remote as Remote Git server
CLI->>Node: GET /api/v1/repos/{owner}/{repo}/withheld-paths
Node-->>CLI: JSON { repo, withheld, reinclude }
CLI->>Git: git clone --filter=blob:none --no-checkout (or full clone)
Git->>Remote: fetch objects and refs
CLI->>Git: configure sparse-checkout (include-all, negate withheld, add reinclude)
Git->>Remote: checkout requested branch (may use origin/HEAD)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/api/visibility.rs`:
- Around line 204-212: The handler currently returns withheld metadata without
verifying the caller can read the repository root; before calling
crate::visibility::withheld_globs or returning the JSON, perform a root-read
gate by calling the existing visibility_check (or equivalent) with path "/"
using the same auth, rules, record.is_public, and record.owner_did; if
visibility_check(..., "/") denies access, return the repo-not-found/unauthorized
response used elsewhere (same error path as other root-read failures) instead of
returning the withheld data. Ensure this check is placed after fetching rules
(list_visibility_rules) but before invoking withheld_globs and constructing the
Json response.
In `@crates/gitlawb-node/src/visibility.rs`:
- Around line 114-119: The current withheld_globs filtering uses a single
representative probe (glob_prefix) and calls visibility_check, which
misclassifies a denied parent when a more-specific child rule allows access;
update the logic in withheld_globs (or change its API contract) to detect
allow-overrides by checking for any more-specific descendant allow before
emitting the parent glob: either (A) when
visibility_check(glob_prefix(&r.path_glob)) == Decision::Deny, scan rules for
any child globs under r.path_glob that yield Decision::Allow (using
visibility_check on those child probes) and skip emitting the parent if any
exist, or (B) change the return structure to emit tuples like (denying_glob,
allowed_exceptions) so the caller can re-include allowed descendants; reference
visibility_check, glob_prefix, Decision::Deny, and the withheld_globs helper to
locate where to implement this.
In `@crates/gl/src/clone.rs`:
- Around line 133-144: The current parsing uses split_once('/') which accepts
inputs like "owner/name/extra"; update the parsing logic in clone.rs (the block
that computes owner and name from stripped) to reject any repo string containing
more than one slash: after trimming trailing slashes and calling
split_once('/'), validate that neither owner nor name contains another '/' (or
alternately check that the count of '/' is exactly one) and bail! with the same
error message if extra slashes are present so malformed inputs like
"owner/name/extra" fail fast instead of producing an invalid repo identifier.
- Around line 147-171: fetch_withheld currently swallows all errors and returns
an empty Vec, which allows clones to proceed when the withheld-paths endpoint
actually failed; change fetch_withheld to return a Result<Vec<String>,
anyhow::Error> (or your crate's error type) instead of Vec<String>, and
propagate any network/HTTP/json errors to the caller except for an explicit
“endpoint unsupported” HTTP status (e.g. 404 Not Found or 501 Not Implemented)
where you should return Ok(Vec::new()). Concretely: update fetch_withheld
signature, stop using unwrap_or_default on resp.json(), treat Err(resp) and
non-success statuses as Err unless status is 404/501 (in which case return
Ok([])), and update callers of fetch_withheld to handle the Result; refer to
load_keypair_from_dir, NodeClient::get_signed/get and the path
"/api/v1/repos/{owner}/{name}/withheld-paths" to locate the code to change.
- Around line 112-121: The current logic in clone.rs extracts the default branch
by parsing stdout from `git remote show origin` (variables `out`, `text`,
`head`) and does not check `out.status.success()`, which breaks for localized
output or non-zero exit; replace this with a refs-based lookup by running `git
symbolic-ref --short refs/remotes/origin/HEAD` (or equivalent) and check the
child process exit status (`out.status.success()`), then read and trim stdout to
produce the branch name, returning an error if the command fails or stdout is
empty instead of relying on localized `HEAD branch:` parsing.
- Around line 99-105: The loop in setup_partial_clone that converts
withheld_globs currently always strips a trailing "**" and emits only a
directory-exclude like "!{dir}/", which fails to exclude the exact path when the
upstream sent "/prefix" (no "/**"). Change the logic inside the for g in
withheld_globs loop: detect whether g originally ended with "**" (or "/**"); if
it did keep the existing behavior (emit "!{dir}/"), but if it did not then emit
two exclude lines for that dir—one for the exact path ("!{dir}") and one for the
subtree ("!{dir}/")—so the sparse-checkout encoding matches visibility_check
semantics; update/add a unit/integration test that calls setup_partial_clone (or
the surrounding code) with a withheld glob like "/docs/private" to assert both
the exact path and subtree are withheld.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 43094733-35b0-4310-8cae-ee86146445f9
📒 Files selected for processing (5)
crates/gitlawb-node/src/api/visibility.rscrates/gitlawb-node/src/server.rscrates/gitlawb-node/src/visibility.rscrates/gl/src/clone.rscrates/gl/src/main.rs
Three fixes from the PR Gitlawb#33 review: - withheld_paths now applies the whole-repo "/" read gate (returns repo-not-found when the caller cannot read the root), matching the git read endpoints. Without it the endpoint disclosed a private repo's existence and path layout to unauthorized callers. The withheld_globs doc already assumed this gate existed; now it does. - A nested allow under a denied parent (e.g. "/secret/public/**" allowed, "/secret/**" denied) was over-withheld: the client sparse-excluded the whole parent and hid paths the caller may read. The endpoint now also returns a "reinclude" list (allowed globs strictly under a denied one) and gl clone re-includes them in the sparse spec after the excludes. - Wildcard-free globs like "/docs/private" match both the exact path and a subtree (per glob_matches), but the client only emitted the subtree exclude. sparse_patterns now emits both "/docs/private" and "/docs/private/". Verified the exclude-then-reinclude sparse ordering checks out cleanly with real git, plus unit tests for reincluded_globs, the nested re-include, the exact-path exclude, and sparse_patterns.
…eld-path errors
split_once('/') accepted owner/name/extra, smuggling a path segment into
the repo name that then flowed into the API path and clone URL; reject it.
fetch_withheld swallowed every network/auth/5xx/JSON error into an empty
result, dropping to a stock clone that the node refuses once blobs are
withheld. Now only 404/501 (endpoint unsupported) fall back to empty; the
rest propagate so the real cause surfaces.
Add a regression test for deny /secret, allow /secret/public, deny /secret/public/admin and clarify the sparse-checkout comment. git does not re-traverse an explicitly excluded directory, so emitting all excludes before re-includes keeps the deepest deny in force; the broader parent re-include does not resurrect it.
…hema Read the default branch from the local origin/HEAD symref clone already set, instead of parsing the localized, network-dependent output of git remote show origin. Deserialize the withheld-paths body into a typed struct so a missing or mistyped withheld/reinclude field is a hard error rather than silently becoming an empty list, which would mask a server regression behind a later clone failure.
Pairs with #28 (subtree content withholding). That PR makes the node withhold private blob bytes from the served pack; the resulting pack is not closed under reachability, so a stock
git cloneis refused at fetch unless the user manually passes--filter. This adds the client-side piece so a non-reader gets a clean checkout with one command.What's here
GET /api/v1/repos/{owner}/{repo}/withheld-paths: returns only the path globs the (optionally authenticated) caller is denied. Not owner-gated and never exposes reader DIDs, unlikelist_visibility. Computed from the existingvisibility_checkvia a new purewithheld_globs.gl clone <gitlawb://owner/name | owner/name> [dir]: asks the node what's withheld for the caller, then clones as a promisor (git clone --filter=blob:none --no-checkout) and sparse-excludes those globs before checkout. Public files land, private paths are absent, the tree entries and SHAs stay visible, no error.A
connect-mode helper can't influence git's own clone logic, so the orchestration lives ingl clonerather than trying to make a baregit clone gitlawb://...work without flags (noted as a follow-up).Notes
withheld-pathsreturns[]andgl clonebehaves like a normal clone. It starts doing real work once feat(node): subtree content withholding (Phase 3) for #18 #28 lands.--filter=blob:noneis the exact client invocation feat(node): subtree content withholding (Phase 3) for #18 #28's node test already proved accepts the withholding pack.--no-conesparse mode is required for the negated!/dir/excludes.Test plan
cargo test -p gl -p gitlawb-node(new:withheld_globs,gl cloneorchestration against afile://bare, repo-arg parsing)/secretmode-B repo,gl cloneit and confirmpublic/present,secret/absent from the worktree, secret path + SHA still ingit ls-tree.Summary by CodeRabbit
New Features
Tests